Machine Learning Methods: Gender Disparities and Political Influence

Clustering and Classification for Workforce Analytics

Authors
Affiliation

Team 8

Jianhao Hong

Boston University

Xinran Li

Boston University

Chialing Sung

Boston University

Zimo Zeng

Boston University

1 🎯Objectives

This section aims to investigate the intersection of gender, occupation, and political geography through the lens of machine learning. The analysis is based primarily on the Lightcast Job Postings dataset (Lightcast (2024)), which contains detailed job posting information across U.S. states, occupations, and industries. To enhance the demographic insight, we merged this with gender employment statistics from the U.S. Bureau of Labor Statistics(U.S. Bureau of Labor Statistics (2023)), using occupational categories as the joining key.

Specifically, we pursue the following objectives:

  • Apply unsupervised learning (KMeans clustering) to identify natural groupings of occupations based on gender representation.
  • Build classification models to predict gender dominance using regional, occupational, and industry-level features.
  • Analyze political geography by comparing gender dominance patterns across red and blue states.
  • Visualize the distribution of gender-dominated jobs across states through maps and word clouds to uncover regional and ideological disparities.

1.0.1 🔢 Data Preparation

To explore gender disparities in occupational distribution, we combined two key data sources:

We merged the datasets by aligning NAICS industry codes to standard ONET occupation categories, calculating the female_ratio (number of women divided by total employees) for each occupation. The final cleaned dataset includes unique occupations with their associated industry and gender composition.

Code
import pandas as pd

xlsx_path = "/home/ubuntu/ad688-employability-sp25A1-group8-1/data/employment_gender.xlsx"
job_posting_path = "/home/ubuntu/github-classroom/met-ad-688/assignment-03-zimozeng12/lightcast_job_postings.csv"

df_gender = pd.read_excel(xlsx_path, sheet_name="Sheet1", engine="openpyxl")
df_gender["female_ratio"] = df_gender["women"] / df_gender["total"]

df_jobs = pd.read_csv(job_posting_path, low_memory=False)
df_jobs["NAICS2"] = pd.to_numeric(df_jobs["NAICS2"], errors="coerce")

naics_to_occupation = {
    11: "Farming, fishing, and forestry occupations",
    21: "Natural resources, construction, and maintenance occupations",
    22: "Production, transportation, and material moving occupations",
    23: "Construction and extraction occupations",
    31: "Production, transportation, and material moving occupations",
    42: "Sales and office occupations",
    44: "Sales and office occupations",
    48: "Production, transportation, and material moving occupations",
    51: "Computer and mathematical occupations",
    52: "Business and financial operations occupations",
    53: "Sales and office occupations",
    54: "Professional and related occupations",
    55: "Management occupations",
    56: "Office and administrative support occupations",
    61: "Education, training, and library occupations",
    62: "Healthcare practitioners and technical occupations",
    71: "Arts, design, entertainment, sports, and media occupations",
    72: "Food preparation and serving related occupations",
    81: "Personal care and service occupations",
    92: "Public Administration",
    99: "Unclassified"
}
df_jobs["Occupation"] = df_jobs["NAICS2"].map(naics_to_occupation)

df_merged = df_jobs.merge(
    df_gender[["occupation", "female_ratio"]],
    left_on="Occupation", right_on="occupation", how="left"
)

df_cleaned = (
    df_merged[["NAICS2_NAME", "Occupation", "female_ratio"]]
    .dropna()
    .sort_values("female_ratio", ascending=False)
    .drop_duplicates(subset="Occupation", keep="first")
    .reset_index(drop=True)
)

from IPython.display import display
display(df_cleaned)
NAICS2_NAME Occupation female_ratio
0 Health Care and Social Assistance Healthcare practitioners and technical occupat... 0.758788
1 Other Services (except Public Administration) Personal care and service occupations 0.748341
2 Educational Services Education, training, and library occupations 0.727640
3 Administrative and Support and Waste Managemen... Office and administrative support occupations 0.712298
4 Wholesale Trade Sales and office occupations 0.605601
5 Professional, Scientific, and Technical Services Professional and related occupations 0.565025
6 Finance and Insurance Business and financial operations occupations 0.539946
7 Accommodation and Food Services Food preparation and serving related occupations 0.539016
8 Arts, Entertainment, and Recreation Arts, design, entertainment, sports, and media... 0.480161
9 Management of Companies and Enterprises Management occupations 0.419353
10 Agriculture, Forestry, Fishing and Hunting Farming, fishing, and forestry occupations 0.270517
11 Information Computer and mathematical occupations 0.268840
12 Manufacturing Production, transportation, and material movin... 0.249274
13 Mining, Quarrying, and Oil and Gas Extraction Natural resources, construction, and maintenan... 0.058216
14 Construction Construction and extraction occupations 0.043041

1.0.2 Unsupervised Learning: KMeans Clustering

We begin our analysis by applying KMeans clustering to examine gender-related occupational patterns across industries. Using the female_ratio (proportion of women in each occupation) as the core feature, we explore latent structures in the labor market.

1.0.2.1 ONET Occupation Reference

We aligned job postings with ONET standard occupation classifications and used these as contextual anchors for clustering. Each occupation was mapped from NAICS industry codes to ONET categories.

1.0.2.2 Elbow Method for Optimal Clusters

To determine the ideal number of clusters (k), we used the Elbow Method, plotting inertia values (within-cluster sum of squares) against k. The curve showed a sharp drop up to k=3, after which improvements diminished. This suggests that k=3 is the most balanced choice between complexity and interpretability.

Code
from sklearn.cluster import KMeans
import plotly.express as px

X = df_cleaned[["female_ratio"]].values

kmeans = KMeans(n_clusters=3, random_state=42)
df_cleaned["Cluster"] = kmeans.fit_predict(X)

df_cleaned["Cluster"] = df_cleaned["Cluster"] + 1
Code
import plotly.graph_objects as go
from sklearn.cluster import KMeans
import numpy as np
import plotly.io as pio

X = df_cleaned[["female_ratio"]].values

inertia = []
K_range = range(1, 10)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=list(K_range),
    y=inertia,
    mode='lines+markers',
    marker=dict(size=8),
    line=dict(width=2),
    name="Inertia"
))

fig.update_layout(
    title="Elbow Method for Optimal k (KMeans)",
    xaxis_title="Number of Clusters (k)",
    yaxis_title="Inertia (Within-Cluster Sum of Squares)",
    template="plotly_white"
)

fig.write_image("_output/ml_1_kmeans.png")
fig.show()
1.0.2.3 Cluster Interpretation

The resulting dataset exhibits a clear stratification of gender representation across occupational clusters:

  • Cluster 1: Female-Dominated – Includes industries like healthcare, education, and personal services, where women comprise over 70% of the workforce. These roles tend to be service-oriented and caregiving in nature.
  • Cluster 2: Mixed-Gender – Comprises fields like finance, administration, and professional services, showing near gender parity or moderate imbalance.
  • Cluster 3: Male-Dominated – Includes occupations in construction, manufacturing, and technical sectors, with female participation often below 30%, sometimes as low as 4%.

This distribution aligns with well-documented patterns in labor economics literature, where occupational segregation plays a major role in shaping gender dynamics and wage inequality.

Each occupation was assigned a cluster based on its female_ratio, and visualized using a scatter plot:

Code
cluster_means = df_cleaned.groupby("Cluster")["female_ratio"].mean().sort_values()

cluster_ordered = cluster_means.index.tolist()

cluster_to_label = {
    cluster_ordered[0]: "Male-dominated",
    cluster_ordered[1]: "Mixed",
    cluster_ordered[2]: "Female-dominated"
}

df_cleaned["Cluster_Label"] = df_cleaned["Cluster"].map(cluster_to_label)
Code
fig = px.scatter(
    df_cleaned,
    x="female_ratio",
    y="Occupation",
    color="Cluster_Label",
    hover_data=["NAICS2_NAME", "Cluster"],
    title="KMeans Clustering on Female Ratio (Labeled as Gender Dominance)",
    labels={"female_ratio": "Female Ratio", "Occupation": "Occupation"},
    height=600
)

fig.update_traces(marker=dict(size=12, line=dict(width=1, color='DarkSlateGrey')))
fig.update_layout(template="plotly_white", legend_title_text="Gender Dominance")
fig.write_image("_output/ml_2_clustering.png")
fig.show()

This unsupervised learning step allows us to categorize occupations systematically and provides a foundation for further supervised modeling and political analysis.

1.0.3 Salary Summary by Gender-Dominance Clusters

After clustering occupations by gender composition, we analyzed salary disparities across these clusters. The salaries were averaged and formatted to two decimal places for clarity.

Code
df_jobs["Occupation"] = df_jobs["NAICS2"].map(naics_to_occupation)

df_salary = df_jobs[["Occupation", "SALARY", "SALARY_FROM", "SALARY_TO", "NAICS2_NAME"]]

df_salary_grouped = (
    df_salary
    .groupby("Occupation")[["SALARY", "SALARY_FROM", "SALARY_TO"]]
    .mean()
    .reset_index()
)

df_cleaned_salary = df_cleaned.merge(df_salary_grouped, on="Occupation", how="left")

salary_summary = (
    df_cleaned_salary
    .groupby("Cluster_Label")[["SALARY", "SALARY_FROM", "SALARY_TO"]]
    .agg(["mean", "median", "count"])
    .round()
)

display(salary_summary)
SALARY SALARY_FROM SALARY_TO
mean median count mean median count mean median count
Cluster_Label
Female-dominated 94055.0 96352.0 4 79493.0 79391.0 4 103683.0 105710.0 4
Male-dominated 115422.0 115134.0 5 94387.0 95448.0 5 131923.0 127799.0 5
Mixed 118009.0 115698.0 6 92756.0 89902.0 6 137956.0 134247.0 6

1.0.4 🔍 Key Insights:

  • Female-dominated occupations earn significantly less on average than male-dominated and mixed clusters.

  • The highest earning group is the “Mixed” cluster, suggesting that gender-integrated occupations may offer more competitive wages.

  • The wage gap is nontrivial, with male-dominated jobs paying over $20,000 more than female-dominated ones on average.

  • These trends are consistent across SALARY_FROM and SALARY_TO ranges as well, showing robustness.

Code
import plotly.express as px

fig = px.bar(
    df_cleaned_salary,
    x="Cluster_Label",
    y="SALARY",
    color="Cluster_Label",
    title="Average Salary by Gender Cluster",
    labels={"SALARY": "Average Salary"},
    height=500
)
fig.update_layout(template="plotly_white", showlegend=False)
fig.write_image("_output/ml_3_gender_cluster.png")
fig.show()

1.0.5 📊 Visualization

  • The bar chart below clearly visualizes these differences, highlighting that:

  • Female-dominated roles tend to cluster in the lower salary range.

  • Mixed-gender roles span a much broader and higher salary spectrum.

  • Male-dominated roles sit in between but still significantly above the female cluster.

We observe a consistent pattern in salary distribution by cluster: occupations classified as female-dominated tend to have the lowest compensation, while mixed-gender and male-dominated fields show significantly higher average and maximum salaries. These findings echo prior literature on occupational segregation and wage inequality (Blau and Kahn (2017)).

2 🗳️ Political Influence: Red vs. Blue States and Gender-Dominated Jobs

Building upon the gender clustering analysis, we now shift our lens to political geography—exploring how gender-dominated occupations are distributed across red and blue states in the U.S.

To enable this, we added a STATE_NAME field to each occupation in our salary-enhanced dataset and manually mapped each state to its political leaning:

  • 🟥 Red states: Texas, Florida, Alabama, Mississippi, Tennessee
  • 🟦 Blue states: California, New York, Massachusetts, Illinois, Washington

We then classified each job cluster by state, and grouped the counts of female-dominated, male-dominated, and mixed occupations per political alignment.

Code
import pandas as pd
import plotly.express as px

us_state_abbrev = {
    'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR', 'California': 'CA',
    'Colorado': 'CO', 'Connecticut': 'CT', 'Delaware': 'DE', 'Florida': 'FL', 'Georgia': 'GA',
    'Hawaii': 'HI', 'Idaho': 'ID', 'Illinois': 'IL', 'Indiana': 'IN', 'Iowa': 'IA',
    'Kansas': 'KS', 'Kentucky': 'KY', 'Louisiana': 'LA', 'Maine': 'ME', 'Maryland': 'MD',
    'Massachusetts': 'MA', 'Michigan': 'MI', 'Minnesota': 'MN', 'Mississippi': 'MS', 'Missouri': 'MO',
    'Montana': 'MT', 'Nebraska': 'NE', 'Nevada': 'NV', 'New Hampshire': 'NH', 'New Jersey': 'NJ',
    'New Mexico': 'NM', 'New York': 'NY', 'North Carolina': 'NC', 'North Dakota': 'ND', 'Ohio': 'OH',
    'Oklahoma': 'OK', 'Oregon': 'OR', 'Pennsylvania': 'PA', 'Rhode Island': 'RI', 'South Carolina': 'SC',
    'South Dakota': 'SD', 'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT', 'Vermont': 'VT',
    'Virginia': 'VA', 'Washington': 'WA', 'West Virginia': 'WV', 'Wisconsin': 'WI', 'Wyoming': 'WY',
    'District of Columbia': 'DC'
}

blue_states = ["California", "New York", "Massachusetts", "Illinois", "Washington"]
red_states = ["Texas", "Florida", "Alabama", "Mississippi", "Tennessee"]

df_cleaned_salary = df_cleaned_salary.merge(
    df_jobs[["Occupation", "STATE_NAME"]].drop_duplicates(),
    on="Occupation",
    how="left"
)

df_cleaned_salary["State_Political"] = df_cleaned_salary["STATE_NAME"].apply(
    lambda x: "Blue" if x in blue_states else ("Red" if x in red_states else None)
)

df_polarized = df_cleaned_salary[df_cleaned_salary["State_Political"].notna()].copy()

summary = (
    df_polarized.groupby(["State_Political", "Cluster_Label"])
    .size()
    .unstack()
    .fillna(0)
    .astype(int)
)
Code
summary_percent = summary.div(summary.sum(axis=1), axis=0).round(3) * 100
display(summary_percent)
Cluster_Label Female-dominated Male-dominated Mixed
State_Political
Blue 27.4 31.5 41.1
Red 29.0 30.4 40.6

Despite slight variations, both red and blue states show remarkably similar distributions across the three clusters.

2.0.1 📉 Visual Representation

To better understand the comparison, we visualize the absolute job counts for each cluster in red and blue states:

Code
df_plot = df_polarized.groupby(["State_Political", "Cluster_Label"]).size().reset_index(name="Count")

fig = px.bar(
    df_plot,
    x="State_Political",
    y="Count",
    color="Cluster_Label",
    barmode="stack",
    title="Gender-Dominated Job Clusters by State Political Leaning",
    height=500
)

fig.update_layout(template="plotly_white")
fig.write_image("_output/ml_4_gender-dominated_cluster.png")
fig.show()

2.0.2 Key Takeaways

  • Mixed-gender occupations dominate in both political groups.

  • Red states exhibit a slightly higher proportion of female-dominated roles, but the difference is marginal.

  • The overall gender cluster landscape is relatively stable across political lines, suggesting that broader economic structures may drive occupational gender distributions more than politics alone.

2.1 Geographic Visualization of Gender Cluster Dominance

To further investigate the regional disparities, we visualized the state-level proportion of gender-dominated jobs using choropleth maps.

Code
df_state_cluster = (
    df_cleaned_salary
    .groupby(["STATE_NAME", "Cluster_Label"])
    .size()
    .unstack(fill_value=0)
    .reset_index()
)

df_state_cluster["Total"] = (
    df_state_cluster["Female-dominated"]
    + df_state_cluster["Male-dominated"]
    + df_state_cluster["Mixed"]
)

df_state_cluster["female_ratio"] = df_state_cluster["Female-dominated"] / df_state_cluster["Total"]
df_state_cluster["male_ratio"] = df_state_cluster["Male-dominated"] / df_state_cluster["Total"]
df_state_cluster["mixed_ratio"] = df_state_cluster["Mixed"] / df_state_cluster["Total"]

df_state_cluster["STATE_ABBR"] = df_state_cluster["STATE_NAME"].map(us_state_abbrev)

df_state_cluster["Political"] = df_state_cluster["STATE_NAME"].apply(
    lambda x: "Blue" if x in blue_states else ("Red" if x in red_states else "Other")
)

2.1.1 Female-Dominated Jobs

  • Female-dominated jobs are relatively more concentrated in the Midwest and Northeast.
Code
import plotly.express as px

fig = px.choropleth(
    df_state_cluster,
    locations="STATE_ABBR",
    locationmode="USA-states",
    color="female_ratio",
    hover_name="STATE_NAME",
    hover_data=["female_ratio", "Political"],
    color_continuous_scale=px.colors.sequential.Pinkyl,
    title="Proportion of Female-Dominated Jobs by State",
    scope="usa"
)

fig.update_layout(template="plotly_white")
fig.write_image("_output/ml_5_female-dominated_state.png")
fig.show()

2.1.2 Male-Dominated Jobs

  • Male-dominated jobs show higher proportions in Southern and industrial regions, especially in states like West Virginia and Kentucky.
Code
fig = px.choropleth(
    df_state_cluster,
    locations="STATE_ABBR",
    locationmode="USA-states",
    color="male_ratio",
    hover_name="STATE_NAME",
    hover_data=["male_ratio", "Political"],
    color_continuous_scale=px.colors.sequential.Blues,
    title="Proportion of Male-Dominated Jobs by State",
    scope="usa"
)
fig.update_layout(template="plotly_white")
fig.write_image("_output/ml_6_male-dominated_state.png")
fig.show()

2.1.3 Mixed-Gender Jobs

  • Mixed-gender clusters are more evenly distributed but slightly higher in Western and Northern states.
Code
fig = px.choropleth(
    df_state_cluster,
    locations="STATE_ABBR",
    locationmode="USA-states",
    color="mixed_ratio",
    hover_name="STATE_NAME",
    hover_data=["mixed_ratio", "Political"],
    color_continuous_scale=px.colors.sequential.Oranges,
    title="Proportion of Mixed-Dominated Jobs by State",
    scope="usa"
)
fig.update_layout(template="plotly_white")
fig.write_image("_output/ml_7_mixed_state.png")
fig.show()

These insights suggest that political and cultural climates may indirectly influence the gender composition of regional labor markets, potentially through policy, education access, or industry presence.

2.2 State-Level Dominance by Gender Cluster

Beyond proportions, we identified the most dominant gender cluster per state by counting the number of job categories that fall into each cluster.

Code
df_state_dominance = (
    df_cleaned_salary
    .groupby(["STATE_NAME", "Cluster_Label"])
    .size()
    .unstack(fill_value=0)
)

df_state_dominance["Dominant_Label"] = df_state_dominance.idxmax(axis=1)

df_state_dominance["STATE_ABBR"] = df_state_dominance.index.map(us_state_abbrev)
df_state_dominance["Political"] = df_state_dominance.index.map(
    lambda x: "Blue" if x in blue_states else ("Red" if x in red_states else "Other")
)

df_state_dominance = df_state_dominance.reset_index()
Code
import plotly.express as px

fig = px.choropleth(
    df_state_dominance,
    locations="STATE_ABBR",
    locationmode="USA-states",
    color="Dominant_Label",
    hover_name="STATE_NAME",
    hover_data=["Dominant_Label", "Political"],
    scope="usa",
    color_discrete_map={
        "Female-dominated": "#FF69B4",
        "Male-dominated": "#1E90FF",
        "Mixed": "#90EE90"
    },
    title=" Dominant Gender Cluster by State (KMeans Result)"
)

fig.update_layout(template="plotly_white")
fig.write_image("_output/ml_8_dominated_state.png")
fig.show()

2.2.1 Interpretation:

  • Most states are green (mixed-dominated), indicating a balanced gender structure.
  • States such as Vermont and North Dakota are pink, highlighting female-dominated leadership.
  • Kentucky and West Virginia appear in blue, dominated by male-leaning occupations.

This high-level overview of occupational dominance by gender reveals that while most states exhibit balanced patterns, specific regions still lean toward traditionally gendered industries. These differences may reflect deeper structural factors such as industrial composition, education access, or sociopolitical norms.

2.3 Visual Summary: Word Cloud of State-Level Gender Dominance

To provide a more engaging summary of the state-level dominant gender clusters, we generated a word cloud where:

  • Each state name appears with a font size proportional to the number of jobs in its dominant gender cluster.
  • Colors represent the type of dominance:
    • 💗 Pink for Female-dominated
    • 💙 Blue for Male-dominated
    • 💚 Green for Mixed-gender

This visualization effectively condenses both the magnitude of dominance (via size) and gender pattern (via color) across all U.S. states.

Code
from wordcloud import WordCloud
import matplotlib.pyplot as plt

color_map = {
    "Female-dominated": "#FF69B4",
    "Male-dominated": "#1E90FF",
    "Mixed": "#90EE90"
}

word_freq = {}
word_colors = {}

for _, row in df_state_dominance.iterrows():
    state = row["STATE_NAME"]
    dominant = row["Dominant_Label"]
    size = row[dominant]
    word_freq[state] = size
    word_colors[state] = color_map.get(dominant, "gray")

def color_func(word, *args, **kwargs):
    return word_colors.get(word, "gray")

wc = WordCloud(
    width=1000,
    height=600,
    background_color="white",
    color_func=color_func
).generate_from_frequencies(word_freq)

plt.figure(figsize=(14, 7))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.title("States by Dominant Gender Cluster (Word Size = Job Count)", fontsize=16)
fig.write_image("_output/ml_9_word.png")
plt.show()

2.3.1 Insights:

  • California, Arizona, and Arkansas are among the most prominent states with large mixed-gender clusters.
  • West Virginia and Kentucky, though smaller in job volume, are clearly male-dominated states.
  • A few female-dominated states (like Vermont and Rhode Island) appear subtly but distinctly in pink.

This word cloud serves as a high-level yet informative synthesis of our gender cluster findings across the U.S. geography.

Together, these visualizations illustrate that gender disparity in employment is not only occupationally segmented but also spatially structured, suggesting that:

  • States with differing political climates show subtle distinctions in gender job dominance.
  • Geographic analysis provides essential context when interpreting labor market gender gaps.

These findings highlight the need for region-specific workforce policies that acknowledge both political realities and industry composition.

References

Blau, F. D., and L. M. Kahn. (2017): The gender wage gap: Extent, trends, and explanations,” Journal of Economic Literature, 55, 789–865.
Lightcast. (2024): “Lightcast Job Postings Dataset,”https://drive.google.com/file/d/1VNBTxArDMN2o9fJBDImaON6YUAyJGOU6/view.
U.S. Bureau of Labor Statistics. (2023): “Employed Persons by Detailed Occupation, Sex, Race, and Hispanic or Latino Ethnicity,”https://www.bls.gov/cps/data/aa2023/cpsaat09.htm.